Add node pfail and fail count to cluster info metrics #1910

hpatro · 2025-04-02T21:00:31Z

New fields in CLUSTER INFO:

cluster_nodes_pfail
cluster_nodes_fail
cluster_voting_nodes_pfail
cluster_voting_nodes_fail

I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that.

New output:

> CLUSTER INFO
cluster_state:fail
cluster_slots_assigned:0
cluster_slots_ok:0
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_nodes_pfail:1
cluster_nodes_fail:0
cluster_voting_nodes_pfail:1
cluster_voting_nodes_fail:0
cluster_known_nodes:3
cluster_size:0
cluster_current_epoch:1
cluster_my_epoch:1
cluster_stats_messages_ping_sent:2104
cluster_stats_messages_pong_sent:1906
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:4011
cluster_stats_messages_ping_received:1906
cluster_stats_messages_pong_received:1964
cluster_stats_messages_received:3870
total_cluster_links_buffer_limit_exceeded:0

Signed-off-by: Harkrishn Patro <[email protected]>

zuiderkwast

Makes sense.

So the use case is to be able to write tests more reliably, or is there a "real" use case?

src/cluster_legacy.c

tests/unit/cluster/info.tcl

hpatro · 2025-04-03T16:35:16Z

Makes sense.

So the use case is to be able to write tests more reliably, or is there a "real" use case?

I'm trying to observe first time to node failure detection and time to mark it as complete failure. Without this data, it seems difficult to modify the algorithm and observe the change in behavior.

zuiderkwast · 2025-04-03T17:31:03Z

Observability of failure detection. It's a great concept! ;) Yeah it can be useful for users too, not only for us.

Signed-off-by: Harkrishn Patro <[email protected]>

hpatro · 2025-04-03T18:10:01Z

The test seems flaky. Looking at it
https://github.com/valkey-io/valkey/actions/runs/14249542177/job/39938769646

Signed-off-by: Harkrishn Patro <[email protected]>

codecov · 2025-04-03T18:46:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.99%. Comparing base (f1d8d77) to head (bf9ddf0).
Report is 22 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1910      +/-   ##
============================================
- Coverage     71.03%   70.99%   -0.05%     
============================================
  Files           123      123              
  Lines         65682    65721      +39     
============================================
- Hits          46660    46656       -4     
- Misses        19022    19065      +43

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.11% <100.00%> (+0.05%)`	⬆️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

tests/unit/cluster/info.tcl

zuiderkwast

LGTM.

@valkey-io/core-team Please ack ( 👍 ) two new fields in CLUSTER INFO.

src/cluster_legacy.c

madolson

Do we think having this information is better than just asking end users to run cluster nodes/shards and count the number of failed/pfail nodes? I'm a little worried about end users alarming on this metric, even though it includes nodes that aren't part of quorum and aren't serving any traffic.

hpatro · 2025-04-08T03:20:59Z

Do we think having this information is better than just asking end users to run cluster nodes/shards and count the number of failed/pfail nodes? I'm a little worried about end users alarming on this metric, even though it includes nodes that aren't part of quorum and aren't serving any traffic.

With large cluster I would prefer not pulling cluster nodes/shards output and compute this.
Thinking more on this, I think I would also need the same stats around voting members. Maybe users can alarm on those. 😉

Signed-off-by: Harkrishn Patro <[email protected]>

hpatro · 2025-04-08T18:16:10Z

I've added voting nodes pfail/fail as well. If we decouple voting nodes from data serving node (primary) within the same architecture in the future, will have to add two additional metric (primary_fail / primary_pfail).

@madolson let me know your thoughts.

Signed-off-by: Harkrishn Patro <[email protected]>

Code changes: valkey-io/valkey#1910 Signed-off-by: Harkrishn Patro <[email protected]>

New fields in CLUSTER INFO: * `cluster_nodes_pfail` * `cluster_nodes_fail` * `cluster_voting_nodes_pfail` * `cluster_voting_nodes_fail` I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that. New output: ``` > CLUSTER INFO cluster_state:fail cluster_slots_assigned:0 cluster_slots_ok:0 cluster_slots_pfail:0 cluster_slots_fail:0 cluster_nodes_pfail:1 cluster_nodes_fail:0 cluster_voting_nodes_pfail:1 cluster_voting_nodes_fail:0 cluster_known_nodes:3 cluster_size:0 cluster_current_epoch:1 cluster_my_epoch:1 cluster_stats_messages_ping_sent:2104 cluster_stats_messages_pong_sent:1906 cluster_stats_messages_meet_sent:1 cluster_stats_messages_sent:4011 cluster_stats_messages_ping_received:1906 cluster_stats_messages_pong_received:1964 cluster_stats_messages_received:3870 total_cluster_links_buffer_limit_exceeded:0 ``` --------- Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Nitai Caro <[email protected]>

New fields in CLUSTER INFO: * `cluster_nodes_pfail` * `cluster_nodes_fail` * `cluster_voting_nodes_pfail` * `cluster_voting_nodes_fail` I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that. New output: ``` > CLUSTER INFO cluster_state:fail cluster_slots_assigned:0 cluster_slots_ok:0 cluster_slots_pfail:0 cluster_slots_fail:0 cluster_nodes_pfail:1 cluster_nodes_fail:0 cluster_voting_nodes_pfail:1 cluster_voting_nodes_fail:0 cluster_known_nodes:3 cluster_size:0 cluster_current_epoch:1 cluster_my_epoch:1 cluster_stats_messages_ping_sent:2104 cluster_stats_messages_pong_sent:1906 cluster_stats_messages_meet_sent:1 cluster_stats_messages_sent:4011 cluster_stats_messages_ping_received:1906 cluster_stats_messages_pong_received:1964 cluster_stats_messages_received:3870 total_cluster_links_buffer_limit_exceeded:0 ``` --------- Signed-off-by: Harkrishn Patro <[email protected]>

New fields in CLUSTER INFO: * `cluster_nodes_pfail` * `cluster_nodes_fail` * `cluster_voting_nodes_pfail` * `cluster_voting_nodes_fail` I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that. New output: ``` > CLUSTER INFO cluster_state:fail cluster_slots_assigned:0 cluster_slots_ok:0 cluster_slots_pfail:0 cluster_slots_fail:0 cluster_nodes_pfail:1 cluster_nodes_fail:0 cluster_voting_nodes_pfail:1 cluster_voting_nodes_fail:0 cluster_known_nodes:3 cluster_size:0 cluster_current_epoch:1 cluster_my_epoch:1 cluster_stats_messages_ping_sent:2104 cluster_stats_messages_pong_sent:1906 cluster_stats_messages_meet_sent:1 cluster_stats_messages_sent:4011 cluster_stats_messages_ping_received:1906 cluster_stats_messages_pong_received:1964 cluster_stats_messages_received:3870 total_cluster_links_buffer_limit_exceeded:0 ``` --------- Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: hwware <[email protected]>

New fields in CLUSTER INFO: * `cluster_nodes_pfail` * `cluster_nodes_fail` * `cluster_voting_nodes_pfail` * `cluster_voting_nodes_fail` I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that. New output: ``` > CLUSTER INFO cluster_state:fail cluster_slots_assigned:0 cluster_slots_ok:0 cluster_slots_pfail:0 cluster_slots_fail:0 cluster_nodes_pfail:1 cluster_nodes_fail:0 cluster_voting_nodes_pfail:1 cluster_voting_nodes_fail:0 cluster_known_nodes:3 cluster_size:0 cluster_current_epoch:1 cluster_my_epoch:1 cluster_stats_messages_ping_sent:2104 cluster_stats_messages_pong_sent:1906 cluster_stats_messages_meet_sent:1 cluster_stats_messages_sent:4011 cluster_stats_messages_ping_received:1906 cluster_stats_messages_pong_received:1964 cluster_stats_messages_received:3870 total_cluster_links_buffer_limit_exceeded:0 ``` --------- Signed-off-by: Harkrishn Patro <[email protected]>

hpatro added 2 commits April 2, 2025 20:50

Add cluster info metrics for node pfail and fail count

d61d43a

Signed-off-by: Harkrishn Patro <[email protected]>

Fix clang format

56ddb8f

Signed-off-by: Harkrishn Patro <[email protected]>

hpatro requested review from enjoy-binbin and zuiderkwast April 2, 2025 21:02

hpatro changed the title ~~Add cluster info metrics for node pfail and fail count~~ Add node pfail and fail count to cluster info metrics Apr 3, 2025

zuiderkwast reviewed Apr 3, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

tests/unit/cluster/info.tcl Outdated Show resolved Hide resolved

Address feedback

81f0e4e

Signed-off-by: Harkrishn Patro <[email protected]>

Fix flaky tests

601445c

Signed-off-by: Harkrishn Patro <[email protected]>

zuiderkwast reviewed Apr 3, 2025

View reviewed changes

tests/unit/cluster/info.tcl Show resolved Hide resolved

zuiderkwast approved these changes Apr 3, 2025

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

zuiderkwast added the major-decision-pending Major decision pending by TSC team label Apr 3, 2025

hpatro added the cluster label Apr 7, 2025

madolson reviewed Apr 8, 2025

View reviewed changes

Add voting node pfail/fail metric

fc09ebc

Signed-off-by: Harkrishn Patro <[email protected]>

Cleanup

bf9ddf0

Signed-off-by: Harkrishn Patro <[email protected]>

hwware approved these changes Apr 8, 2025

View reviewed changes

madolson added needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Apr 14, 2025

madolson approved these changes Apr 14, 2025

View reviewed changes

hpatro mentioned this pull request Apr 15, 2025

Add documentation for node pfail/fail cluster info metrics valkey-io/valkey-doc#261

Merged

madolson removed the needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. label Apr 15, 2025

madolson merged commit 30dc9a7 into valkey-io:unstable Apr 15, 2025
51 checks passed

madolson added the release-notes This issue should get a line item in the release notes label Apr 15, 2025

madolson pushed a commit to valkey-io/valkey-doc that referenced this pull request Apr 15, 2025

Add documentation for node pfail/fail cluster info metrics (#261)

b8e6bcb

Code changes: valkey-io/valkey#1910 Signed-off-by: Harkrishn Patro <[email protected]>

hpatro mentioned this pull request Jun 27, 2025

Support Large Valkey Cluster #2281

Open

15 tasks

Add node pfail and fail count to cluster info metrics #1910

Add node pfail and fail count to cluster info metrics #1910

Uh oh!

Conversation

hpatro commented Apr 2, 2025 • edited by zuiderkwast Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro commented Apr 3, 2025

Uh oh!

zuiderkwast commented Apr 3, 2025

Uh oh!

hpatro commented Apr 3, 2025

Uh oh!

codecov bot commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

hpatro commented Apr 8, 2025

Uh oh!

hpatro commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hpatro commented Apr 2, 2025 •

edited by zuiderkwast

Loading

codecov bot commented Apr 3, 2025 •

edited

Loading

hpatro commented Apr 8, 2025 •

edited

Loading